We have a list of header names (e.g. "REFERENCENUMBER", "document_id", "buyer", ...). We want to find out which of these header names belong to a given entity (e.g. "buyer" or "award"). In general, an entity is composed of multiple header names and multiple entities can be contained in a single set of header names.
The classification algorithm should accomplish two things:
In [158]:
from words import split_words
In [185]:
import samples
reload(samples)
import samples
#This returns a dict containing all header name combinations for a given entity type.
samples = samples.load_samples_by_entity(["Keywords", "UK", "Georgia", "Canada", "Mexico", "EU"], cache=True)
headers = samples.keys()
print "Entity types:",", ".join(samples.keys())
In [186]:
#Example: header name combinations for 'buyer' entity
print dict(samples)['buyer']
For the classification, we first use a DictVectorizer to generate a sparse feature matrix from the header name occurences. Then, we use a linear support vector classifier to classify the headers.
In [188]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction import text,DictVectorizer
pipe = Pipeline([
('hv', DictVectorizer()),
('svm', LinearSVC()),
])
In [190]:
#We generate the input training data set.
#For each entity type, we generate a dict containing header name occurences for a given entity type and sample.
from collections import defaultdict
counts = []
entities = []
for entity,headers_list in samples.items():
for headers in headers_list:
counts.append(dict([(header,1) for header in headers]))
entities.append(entity)
In [269]:
#We fit the counts to the entities
pipe.fit(counts,entities)
#We predict the type of a given header name
pipe.predict([{'NOTICETYPE' : 1}]),entities[30]
Out[269]:
In [274]:
from sklearn import cross_validation
rs = cross_validation.ShuffleSplit(len(counts), n_iter=2, train_size=0.75, test_size=.25)
for train_index, test_index in rs:
print 'train', train_index, '\ntest', test_index
In [275]:
#Performance of this model is poor because sample size is very small.
for train_index, test_index in rs:
pipe.fit([counts[i] for i in train_index], [entities[i] for i in train_index])
print pipe.score([counts[i] for i in test_index],[entities[i] for i in test_index])